KNN example
-
Concept Introduction:
- Explain KNN as a supervised learning algorithm for classification and regression.
- Describe how it works: finding the
k
nearest data points to a query point and using them for classification or prediction.
-
Mathematics Behind KNN:
- Distance calculation (e.g., Euclidean distance):
- Decision rule: Majority class in the
k
nearest neighbors.
-
Step-by-Step Example:
We'll classify a 2D dataset, show decision boundaries, and explain the classification mathematically.
Here’s the Python code to demonstrate KNN with a graph and a simple dataset:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
# Generate a simple dataset (2D for visualization)
X, y = make_classification(
n_samples=100,
n_features=2,
n_informative=2,
n_redundant=0,
n_clusters_per_class=1,
random_state=42
)
# Split into two classes (easier for visualization)
class_0 = X[y == 0]
class_1 = X[y == 1]
# Plot the dataset
plt.figure(figsize=(8, 6))
plt.scatter(class_0[:, 0], class_0[:, 1], c='blue', label='Class 0')
plt.scatter(class_1[:, 0], class_1[:, 1], c='red', label='Class 1')
plt.title("Simple 2D Dataset for KNN")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.grid()
plt.show()
# Train KNN on the dataset
k = 5
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X, y)
# Plot decision boundary
def plot_decision_boundary(X, y, model, title="Decision Boundary"):
h = 0.02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, alpha=0.8, cmap=plt.cm.Paired)
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolor='k', cmap=plt.cm.Paired)
plt.title(title)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.grid()
plt.show()
plot_decision_boundary(X, y, knn, title=f"KNN Decision Boundary (k={k})")
# Example classification
query_point = np.array([[0, 0]]) # Query point
neighbors = knn.kneighbors(query_point, return_distance=True)
plt.figure(figsize=(8, 6))
plt.scatter(class_0[:, 0], class_0[:, 1], c='blue', label='Class 0')
plt.scatter(class_1[:, 0], class_1[:, 1], c='red', label='Class 1')
plt.scatter(query_point[:, 0], query_point[:, 1], c='green', label='Query Point', s=100)
for i in range(k):
plt.plot(
[query_point[0, 0], neighbors[1][0][i, 0]],
[query_point[0, 1], neighbors[1][0][i, 1]],
'k--'
)
plt.title("Query Point and Nearest Neighbors")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.grid()
plt.show()
Explanation of Code:
-
Dataset:
- A 2D dataset is created using
make_classification
, with two features and two classes. - Points from different classes are visualized in different colors.
- A 2D dataset is created using
-
Visualization:
- The scatter plot shows the distribution of points and how the dataset looks.
-
Training:
- The
KNeighborsClassifier
is trained withk=5
.
- The
-
Decision Boundary:
- The decision boundary is plotted to show how the KNN algorithm splits the feature space.
-
Query Point:
- A query point is shown with its 5 nearest neighbors, visually connecting them.
Mathematical Explanation:
-
Distance:
For the query point ((x, y) = (0, 0)), calculate the Euclidean distance to 5 nearest points: -
Classification Rule:
If most neighbors belong to Class 0, classify the query point as Class 0, and vice versa.